This quickstart guide explains how to join two tables A and B using TF-IDF similarity measure. First, you need to import the required packages as follows (if you have installed py_stringsimjoin it will automatically install the dependencies py_stringmatching and pandas):
In [1]:
# Import libraries
import py_stringsimjoin as ssj
import py_stringmatching as sm
import pandas as pd
import os
import sys
In [3]:
print('python version: ' + sys.version)
print('py_stringsimjoin version: ' + ssj.__version__)
print('py_stringmatching version: ' + sm.__version__)
print('pandas version: ' + pd.__version__)
Joining two tables using TD-IDF measure typically consists of six steps:
We begin by loading the two tables. For the purpose of this guide, we use the books dataset that comes with the package.
In [4]:
# construct the path of the tables to be loaded. Since we are loading a
# dataset from the package, we need to access the data from the path
# where the package is installed. If you need to load your own data, you can directly
# provide your table path to the read_csv command.
table_A_path = os.sep.join([ssj.get_install_path(), 'datasets', 'data', 'books_table_A.csv.gz'])
table_B_path = os.sep.join([ssj.get_install_path(), 'datasets', 'data', 'books_table_B.csv.gz'])
In [5]:
# Load csv files as dataframes. Since we are reading a compressed csv file,
# we provide the compression argument. If you are reading an uncompressed
# csv file, you should not specify the compression argument.
A = pd.read_csv(table_A_path, compression='gzip')
B = pd.read_csv(table_B_path, compression='gzip')
print('Number of records in A: ' + str(len(A)))
print('Number of records in B: ' + str(len(B)))
In [6]:
A.head(1)
Out[6]:
In [7]:
B.head(1)
Out[7]:
Before performing the join, we may want to profile the tables to know about the characteristics of the attributes. This can help identify:
a) unique attributes in the table which can be used as key attribute when performing the join. A key attribute is needed to uniquely identify a tuple.
b) the number of missing values present in each attribute. This can help you in deciding the attribute on which to perform the join. For example, an attribute with a lot of missing values may not be a good join attribute. Further, based on the missing value information you need to decide on how to handle missing values when performing the join
You can profile the attributes in a table using the following command:
In [28]:
# profile attributes in table A
ssj.profile_table_for_join(A)
Out[28]:
In [26]:
# profile attributes in table B
ssj.profile_table_for_join(B)
Out[26]:
Based on the profile output, we find that the 'Title' attribute in both tables does not contain any missing values. Hence, for the purpose of this guide, we will now join tables A and B on 'Title' attribute using TF-IDF measure. Next, we need to decide on what threshold to use for the join. For this guide, we will use a threshold of 0.5. Specifically, the join will now find tuple pairs from A and B such that the TF-IDF score over the 'Title' attributes is at least 0.5.
Naively, performing the join will involve enumerating the cartesian product AxB (3022 x 3099 = 9365178) and computing TF-IDF score for every pair. But, this can be very time consuming. Hence, we can optimize by first appplying an overlap filter over tables A and B to find pairs sharing at least one token in the 'Title' attribute. The intuition here is that in order for TF-IDF score to be above zero, there must be at least one common token between the attributes. Finally, we apply the TF-IDF measure over the candidate pairs to obtain the join output.
Since TF-IDF measure treats input strings as bags of tokens, we need to select a tokenizer which can be used to tokenize each string into a bag of tokens. Currently, we support tokenizers from py_stringmatching package which provides five different tokenizer types: alphabetical tokenizer, alphanumeric tokenizer, delimiter-based tokenizer, qgram tokenizer, and whitespace tokenizer.
For the purpose of this guide, we will use a whitespace tokenizer. Once we have selected a tokenizer type, we need to create a tokenizer object as shown below:
In [8]:
# create whitespace tokenizer for tokenizing 'Title' attribute
ws = sm.WhitespaceTokenizer()
ws.tokenize('The Maze Runner Series Complete Collection')
Out[8]:
In [9]:
# create overlap filter with whitespace tokenizer and threshold of 1.
of = ssj.OverlapFilter(ws, 1)
# apply overlap filter to tables A and B to find tuple pairs
# sharing at least 1 token in Title attribute
C = of.filter_tables(A, B, 'ID', 'ID', 'Title', 'Title', n_jobs=-1)
In [10]:
len(C)
Out[10]:
In [11]:
C.head(5)
Out[11]:
If you want to include pairs with missing value in the output, you need to set the allow_missing flag to True when creating the overlap filter as shown below:
In [ ]:
of = ssj.OverlapFilter(ws, 1, allow_missing=True)
Now, when you apply the filter, pairs with missing values will also be included in the output.
The next step is to create the corpus required for TF-IDF measure. Specifically, the corpus consists of the list of tokens in the 'Title' attribute. The corpus can be created as follows:
In [12]:
# create a list of tokens
A_tokens = A['Title'].apply(ws.tokenize).tolist()
B_tokens = B['Title'].apply(ws.tokenize).tolist()
# merge both the lists of tokens to create the corpus
corpus = A_tokens + B_tokens
Finally, you need to create and apply the TF-IDF matcher as shown below:
In [13]:
# create tf-idf object with the generated corpus
tfidf = sm.TfIdf(corpus, dampen=True)
In [14]:
# apply the matcher with a threshold of 0.5. This will find pairs from C
# with TF-IDF score >= 0.5. Setting n_jobs=-1 exploits all CPU cores available.
output_pairs = ssj.apply_matcher(C, 'l_ID', 'r_ID', A, B, 'ID', 'ID', 'Title', 'Title',
ws, tfidf.get_sim_score, 0.5,
l_out_attrs=['Title'], r_out_attrs=['Title'], n_jobs=-1)
In [15]:
len(output_pairs)
Out[15]:
In [16]:
output_pairs.head()
Out[16]:
If you want to include pairs with missing value in the output, you need to set the allow_missing flag to True in the apply_matcher method.